
Table 2:
Comparison of
[
Sung and Poggio, 1994
]
and our system on Test Set B
Missed Detect False False detect
System faces rate detects rate
10) Networks 1 and 2
!
AND(0)
!
threshold(2,3)
!
overlap elimination 34 78.1% 3 1/3226028
11) Networks 1 and 2
!
threshold(2,2)
!
overlap elimination
!
AND(2) 20 87.1% 15 1/645206
12) Networks 1 and 2
!
threshold(2,2)
!
overlap
!
OR(2)
!
threshold(2,1)
!
overlap 11 92.9% 64 1/151220
[Sung and Poggio, 1994](Multi-layer network) 36 76.8% 5 1/1929655
[Sung and Poggio, 1994](Perceptron) 28 81.9% 13 1/742175
Although our system is less computationally expensive
than [Sung and Poggio, 1994], the system described so far
is not real-time because of the number of windows which
must be classified. In the related task of license plate detec-
tion, [Umezaki, 1995]decreased the number of windows
that must be processed. The key idea was to have the neural
networkbe invariant to translations of about 25% of the size
of a license plate. Instead of a single number indicatingthe
existence of a face in the window, the output of Umezaki’s
network is an image with a peak indicating where the net-
work believes a license plate is located. These outputs are
accumulated over the entire image, and peaks are extracted
to give candidate locations for license plates. In [Rowley et
al., 1995], we show that a face detection network can also
be made translation invariant. However, this translation in-
variant face detector makes many more false detections than
one that detects only centered faces. We use the centered
face detector to verify candidates found by the translation
invariant network. With this approach, we can process a
320x240pixel image in less than 5 seconds on an SGI Indy
workstation. This technique is related, at a high level, to
the technique presented in [Vaillant et al., 1994].
5 Conclusions and future research
Our algorithm can detect between 78.9% and 90.5% of faces
in a set of 130 test images, with an acceptable number of
false detections. Depending on the application, the system
can be made more or less conservative by varying the arbi-
tration heuristics or thresholds used. The system has been
tested on a wide variety of images, with many faces and
unconstrained backgrounds.
There are a number of directions for future work. The
main limitation of the current system is that it only detects
upright faces looking at the camera. Separate versions of
the system could be trained for different head orientations,
andthe resultscould be combined usingarbitrationmethods
similar to those presented here. Other methods of improv-
ingsystem performance includeobtainingmorepositiveex-
amples for training, or applying more sophisticated image
preprocessing and normalization techniques. For instance,
the color segmentation method used in [Hunke, 1994]for
color-based face tracking could be used to filter images.
The face detector would then be appliedonly to portionsof
the image which contain skin color, which would speed up
the algorithm as well as eliminating some false detections.
One application of this work is in the area of media tech-
nology. Every year, improved technology provides cheaper
and more efficient ways of storing information. However,
automatic high-level classification of the information con-
tent is very limited; this is a bottleneck preventing media
technology from reaching its full potential. The work de-
scribed above allows a user to make queries of the form
“Which scenes in this video contain human faces?” and to
have the query answered automatically.
Acknowledgements
The authors thank Kah-Kay Sung and Dr. Tomaso Pog-
gio (at MIT) and Dr. Woodward Yang (at Harvard) for
providing a series of test images and a mug-shot database,
respectively. Michael Smith (at CMU) provided some digi-
tized television images for testing purposes. We also thank
Eugene Fink, Xue-Mei Wang, Hao-ChiWong, TimRowley,
and Kaari Flagstad for comments on drafts of this paper.
References
[Hunke, 1994]H. Martin Hunke. Locating and tracking of hu-
man faces with neural networks. Master’s thesis, University of
Karlsruhe, 1994.
[Le Cun et al., 1989]Y. Le Cun, B. Boser, J. S. Denker, D. Hen-
derson,R. E. Howard,W. Hubbard,and L. D. Jackel. Backpro-
pogation applied to handwritten zip code recognition. Neural
Computation, 1:541–551, 1989.
[Rowley et al., 1995]Henry A. Rowley, Shumeet Baluja, and
Takeo Kanade. Human face detection in visual scenes. CMU-
CS-95-158R, Carnegie Mellon University, November 1995.
Also available at http://www.cs.cmu.edu/˜har/faces.html.
[Sung and Poggio, 1994]Kah-Kay Sung and Tomaso Poggio.
Example-basedlearning for view-based human face detection.
A.I. Memo 1521, CBCL Paper 112, MIT, December 1994.
[Umezaki, 1995]Tazio Umezaki. Personal communication,
1995.
[Vaillant et al., 1994]R. Vaillant, C. Monrocq, and Y. Le Cun.
Original approach for the localisation of objects in images. IEE
Proceedings on Vision, Image, and Signal Processing, 141(4),
August 1994.
[Waibel et al., 1989]Alex Waibel, Toshiyuki Hanazawa, Geof-
frey Hinton, Kiyohiro Shikano, and Kevin J. Lang. Phoneme
recognition using time-delay neural networks. Readings in
Speech Recognition, pages 393–404, 1989.